Using Uplug and SiteSeeker to construct a cross language search engine for Scandinavian

نویسندگان

Hercules Dalianis

Martin Rimka

Viggo Kann

چکیده

This paper presents how we adapted a website search engine for cross language information retrieval, using the Uplug word alignment tool for parallel corpora.We first studied the monolingual search queries posed by the visitors of the website of the Nordic council containing five different languages. In order to compare how well different types of bilingual dictionaries covered the most common queries and terms on the website we tried a collection of ordinary bilingual dictionaries, a small manually constructed trilingual dictionary and an automatically constructed trilingual dictionary, constructed from the news corpus in the website using Uplug. The precision and recall of the automatically constructed Swedish-English dictionary using Uplug were 71 and 93 percent, respectively. We found that precision and recall increase significantly in samples with high word frequency, but we could not confirm that POS-tags improve precision. The collection of ordinary dictionaries, consisting of about 200 000 words, only cover 41 of the top 100 search queries at the website. The automatically built trilingual dictionary combined with the small manually built trilingual dictionary, consisting of about 2 300 words, and cover 36 of the top search queries.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Using Uplug and SiteSeeker to construct a cross language search engine for Scandinavian languages

متن کامل

Advertising Keyword Suggestion Using Relevance-Based Language Models from Wikipedia Rich Articles

When emerging technologies such as Search Engine Marketing (SEM) face tasks that require human level intelligence, it is inevitable to use the knowledge repositories to endow the machine with the breadth of knowledge available to humans. Keyword suggestion for search engine advertising is an important problem for sponsored search and SEM that requires a goldmine repository of knowledge. A recen...

متن کامل

Hand-crafted versus Machine-learned Inflectional Rules: The Euroling-SiteSeeker Stemmer and CST's Lemmatiser

The Euroling stemmer is developed for a commercial web site and intranet search engine called SiteSeeker. SiteSeeker is basically used in the Swedish domain but to some extent also for the English domain. CST’s lemmatiser comes from the Center for Language Technology, University of Copenhagen and was originally developed as a research prototype to create lemmatisation rules from training data. ...

متن کامل

Automatic Construction of Domain-specific Dictionaries on Sparse Parallel Corpora in the Nordic languages

Hallå Norden is a web site with information regarding mobility between the Nordic countries in five different languages; Swedish, Danish, Norwegian, Icelandic and Finnish. We wanted to create a Nordic cross-language dictionary for the use in a cross-language search engine for Hallå Norden. The entire set of texts on the web site was treated as one multilingual parallel corpus. From this we extr...

متن کامل

To search and summarize in Scandinavia

Automatic text summarization is the method where a computer summarizes a text. A text is given to the computer and it returns a non-redundant shorter text. Text summarization can be used to summarize news in the Business Intelligence domain, automatically edit news in the news paper setting domain and summarize news down to a length suitable for SMS and WAP but also to summarize news before the...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره شماره

صفحات -

تاریخ انتشار 2007

Using Uplug and SiteSeeker to construct a cross language search engine for Scandinavian

نویسندگان

چکیده

منابع مشابه

Using Uplug and SiteSeeker to construct a cross language search engine for Scandinavian languages

Advertising Keyword Suggestion Using Relevance-Based Language Models from Wikipedia Rich Articles

Hand-crafted versus Machine-learned Inflectional Rules: The Euroling-SiteSeeker Stemmer and CST's Lemmatiser

Automatic Construction of Domain-specific Dictionaries on Sparse Parallel Corpora in the Nordic languages

To search and summarize in Scandinavia

عنوان ژورنال:

اشتراک گذاری